Add token sharding functions for context parallel #26058

qiruiyangmeta · 2025-10-01T23:01:56Z

Add utility functions to enable load-balanced token sharding for context parallelism.

Purpose

Causal attention imposes a varying computational load for each token, as shown in the following figure. To ensure an even workload distribution, tokens should be partitioned across different context parallelism (CP) ranks. Specifically, the sequence is divided into 2 × cp_world_size chunks. Each CP rank i is assigned both the i-th chunk and the (2 × cp_world_size - i - 1)-th chunk. This approach helps balance the compute load among all CP ranks.

When tokens are distributed across context parallel (CP) ranks, gaps may appear in the block table. After compaction, tokens that are stored physically next to each other may not be logically consecutive. This is acceptable for CP because we only need to maintain the correct relative order of tokens for mapping purposes, rather than tracking their absolute positions in the block table.

Test Plan

pytest tests/v1/attention/test_context_parallel_attention.py

E2e tests to be added in the following PRs.

Test Result

======================================================= test session starts =======================================================
platform linux -- Python 3.12.11, pytest-8.4.2, pluggy-1.6.0
rootdir: /home/qiruiyang/vllm
configfile: pyproject.toml
plugins: anyio-4.11.0
collected 3 items                                                                                                                 

tests/v1/attention/test_context_parallel_attention.py ...                                                                   [100%]

======================================================== 3 passed in 0.51s ========================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces utility functions for token sharding to enable context parallelism, which is a significant feature for improving performance. The changes correctly add the context_parallel_size configuration and update the distributed state management accordingly. The core logic for sharding is well-encapsulated in the new vllm/v1/attention/backends/cp_utils.py file. However, I've identified a couple of issues in the new test file, tests/v1/attention/test_context_parallel_attention.py, including a critical bug in an assertion that needs to be fixed for the tests to be valid.

gemini-code-assist · 2025-10-01T23:03:51Z

tests/v1/attention/test_context_parallel_attention.py

+    assert num_comp_local == [
+        num_computed_tokens[0][-1] // 2, [num_computed_tokens[1][-1] // 2]
+    ]


There appears to be a bug in this assertion. The expected value for num_comp_local should be a list of integers, but the expression [num_computed_tokens[1][-1] // 2] creates a list as the second element, resulting in [5, [4]]. The actual value of num_comp_local is [5, 4], which will cause this assertion to fail. For clarity and correctness, it's better to assert against the hardcoded expected value.

assert num_comp_local == [5, 4]

gemini-code-assist · 2025-10-01T23:03:51Z

tests/v1/attention/test_context_parallel_attention.py

+
+def make_cached_request_state(id: int, prefill_len: int, decode_len: int,
+                              num_computed_tokens: list[int]):
+    assert prefill_len + decode_len == sum(num_computed_tokens)


The assertion in this helper function is incorrect. num_computed_tokens is a list of cumulative token counts, so sum(num_computed_tokens) does not represent the total number of tokens. The total number of tokens is the last element of the list. The assertion should be assert prefill_len + decode_len == num_computed_tokens[-1].

Suggested change

assert prefill_len + decode_len == sum(num_computed_tokens)

assert prefill_len + decode_len == num_computed_tokens[-1]

mergify · 2025-10-07T22:16:22Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @qiruiyangmeta.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

qiruiyangmeta requested review from LucasWilkinson, ProExpertProg, WoosukKwon, alexm-redhat, comaniac, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners October 1, 2025 23:01

mergify bot added the v1 label Oct 1, 2025

gemini-code-assist bot reviewed Oct 1, 2025

View reviewed changes

qiruiyangmeta force-pushed the prepare_inputs_for_cp branch from aa10cb2 to 7a51dd0 Compare October 2, 2025 16:51

qiruiyangmeta mentioned this pull request Oct 2, 2025

[RFC]: Support Context Parallelism with Fully Sharded KV Cache and Ring Attention #26133

Open

1 task

qiruiyangmeta force-pushed the prepare_inputs_for_cp branch from 9fccf80 to d07fd25 Compare October 2, 2025 22:47

Qirui Yang added 6 commits October 2, 2025 19:42

Add context parallelism configurations and parallel group

633bde0

Lint

09e199a

Fix EP group ranks

0f803e8

Add token sharding functions and tests for context parallelism

3ba4ecd

lint

fe334e6

lint

f6b6ed3

qiruiyangmeta force-pushed the prepare_inputs_for_cp branch from d07fd25 to f6b6ed3 Compare October 3, 2025 02:54

mergify bot added the needs-rebase label Oct 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add token sharding functions for context parallel #26058

Add token sharding functions for context parallel #26058

Uh oh!

qiruiyangmeta commented Oct 1, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 1, 2025

Uh oh!

gemini-code-assist bot Oct 1, 2025

Uh oh!

mergify bot commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	assert prefill_len + decode_len == sum(num_computed_tokens)
	assert prefill_len + decode_len == num_computed_tokens[-1]

Uh oh!

Add token sharding functions for context parallel #26058

Are you sure you want to change the base?

Add token sharding functions for context parallel #26058

Uh oh!

Conversation

qiruiyangmeta commented Oct 1, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

qiruiyangmeta commented Oct 1, 2025 •

edited by github-actions bot

Loading